Skip to content

OpenGVLab/ego4d-eccv2022-solutions

Repository files navigation

ego4d-eccv2022-solutions

It is our solutions repository for Ego4D challenges in ECCV2022 workshop.

Techical report

Ego4D Slides (in Chinese) Ego4D Solutions (in Chinese)

PWC PWC PWC PWC PWC

📢News

(2023/10/10) We use the weights pre-trained on the verb subset for TAL of Perception Test task and bring 2 points of performance improvement. The extracted features can be download at here.

(2023/07/13) We release the ViT-L weights finetuned on Ego4D-MQ dataset.

(2023/04/11) 🚀We release the leading model of SCOD task.

(2022/12/11) 🚀🚀We release code and checkpoints of pretraining, FHP task and SCOD task.

(2022/12/01) 🚀The VideoMAE features for MQ and NLQ are released.

(2022/11/17) 🔄The repository is created.

Catalog

  • Codes for Feature Extractor
  • Verb Noun Features (VideoMAE-L) for MQ and NLQ
  • Codes for pretraining
  • Codes for STA
  • Codes for Hands
  • Codes and checkpoints for SCOD

Video Features for MQ and NLQ.

We provide the video features extracted by VideoMAE-L pretrained on verb and noun subset.

Feature Baidu Netdisk Zenodo
MQ(verb) Download. code: sxda Download
NLQ(verb) Download. code: teod Download
NLQ(noun) Download. code: wrop Download

You can check more details in our techical report.

Pretraining.

Our training strategy is based on the vanilla method and is easy to follow. We use VideoMAE codebase for training and validation. Before training, you have to follow it to install the python environment. We split the training annotations filtered by EgoVLP for rapid development. The second-filtered annotations files are available here. We release the checkpoints in the below table.

Method Pretrain Resolution Subset Top-1 Top-5 Weights
ViT-L K700 224x224 verb 52.51 86.05 Download
ViT-L K700 224x224 noun 33.41 85.51 Download
ViT-L K700+verb 224x224 MQ - - Download
UniFormer-B K600 320x320 verb 49.30 83.61 Download

Note: For the ViT-L weight finetuned on MQ tasks, some keys of state_dict may need to modify to adapt the model code.

Training

We provide the training script on SLURM mode. If you want to use PyTorch-DDP mode, you can use scripts in scripts/pytorch_ddp.

bash scripts/slurm/ego4d_verb_slurm_pretrain_vitl_k400.sh

In the script, you need to set the approaiate OUTPUT_DIR and MODEL_PATH.

STA.

Training

We use the ViT-Large model to train the STA task.

sh scripts/slurm/sta_train.sh

Validation

cd forecasting_eval
sh sta_val.sh

FHP

Training

We train the FHP task using Uniformer-B and the weights pretrained on Ego4D verb subset. We provide the training script on SLURM mode. If you want to use PyTorch-DDP mode, you can use scripts in scripts/pytorch_ddp.

bash scripts/slurm/ego4d_hands_uniformer.sh

In the script, you need to set the approaiate OUTPUT_DIR and MODEL_PATH.

Validation

We also provide the script for validation and testing. You can launch the script below to validate a specific checkpoint's performance.

bash scripts/slurm/ego4d_hands_uniformer_val.sh

In the script, you need to set the approaiate OUTPUT_DIR, MODEL_PATH, --test_subset and --test_num_segment.

SCOD

Our detection code for SCOD is developed on top of MMDetection.

Download: the converted annotations for SCOD: download

We report the performance on the validation set and release the checkpoint in the below table.

Method Pretrain Resolution AP AP50 AP75 Config Download
UniFormer-L IN-1K 800-1600/2000 24.8 44.2 24.0 config ckpt | log
Swin-L IN-22K+O365 800-1600/2000 36.4 56.5 37.6 config ckpt | log

To train UniFormer-L + DINO on the SCOD training set with 8 gpus for 12 epochs:

sh tools/dist_train.sh configs/scod/dino_5scale_uniformer-l_8x2_12e_scod_imagenet1k.py 8

To test UniFormer-L + DINO on the SCOD validation set with 8 gpus:

sh tools/dist_test.sh configs/scod/dino_5scale_uniformer-l_8x2_12e_scod_imagenet1k.py <ckpt-path> 8 --eval bbox

It should give:

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.248
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=1000 ] = 0.442
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=1000 ] = 0.240
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.002
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.075
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.282
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.638
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=300 ] = 0.638
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=1000 ] = 0.638
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.054
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.321
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.697

🎓Citation

If this work is helpful for your research, please consider citing our techical report.

@article{chen2022ego4d,
  title={InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges},
  author={Chen, Guo and Xing, Sen and Chen, Zhe and Wang, Yi and Li, Kunchang and Li, Yizhuo and Liu, Yi and Wang, Jiahao and Zheng, Yin-Dong and Huang, Bingkun and others},
  journal={arXiv preprint arXiv:2211.09529},
  year={2022}
}